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Abstract 


In a distributed computing environment, it is important to ensure that the processor work- 
loads are adequately balanced. Among numerous load-balancing algorithms, a unique ap- 
proach due to Das and Prasad defines a symmetric broadcast network (SBN) that provides a 
robust communication pattern among the processors in a topology-independent manner. In this 
paper, we propose and analyze three efficient SBN-based dynamic load-balancing algorithms, 
and implement them on an SGI 0rigin2000. A thorough experimental study with Poisson- 
distributed synthetic loads demonstrates that our algorithms are effective in balancing system 
load. By optimizing completion time and idle time, the proposed algorithms are shown to 
compare favorably with several existing approaches. 

Key words: Dynamic load balancing; network topology; job migration; performance study. 

1 Introduction 

To maximize the performance of a multicomputer system, it is essential to evenly distribute the 
workload among the available processors. In other words, it is desirable to prevent, if possible, 
the condition where one processor is overloaded with a backlog of jobs to be serviced while an- 
other processor is lightly loaded or even idle. The load-balancing problem is closely related to 
scheduling and resource allocation, and can be either static or dynamic. In static load balancing 
algorithms [18], the decisions are made at compile time when resource requirements are estimated. 
On the other hand, dynamic algorithms [2, 7, 11, 12, 18, 23] allocate/reallocate resources at run- 
time based on a set of system parameters, which may determine when jobs can be migrated and 
also account for the associated overhead in such a transfer [17]. Determining the parameters to 
be maintained and how to broadcast them among processors are important design considerations, 
normally resolved by distributed scheduling policies [10, 13]. 

This paper deals with decentralized load balancing in distributed-memory multicomputers in 
which processors are connected by a point-to-point network topology and communicate with one 
another via message passing. The network is assumed to be homogeneous and any job can be ser- 
viced by any processor; however, jobs cannot be rerouted once execution begins. We have confined 
the scope of this paper to independent tasks only. The workload of a processor is determined by 
the length of its local job queue. 

In particular, we propose three efficient dynamic load-balancing algorithms which make use of 
a logical and topology-independent communication pattern among processors, called a symmetric 
broadcast network (SBN), introduced in [5, 6], These algorithms will be called (i) Basic SBN 
algorithm, (ii) Hypercube Variant, and (iii) Heuristic Variant. 

SBN-based load balancing can be initiated by any processor that has too many or too few jobs 
to process based on certain threshold values. Balancing messages are first broadcast so that the cur- 
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rent system load of jobs can be estimated. This is followed by distribution messages which reassign 
jobs to minimize the possibility of processors becoming idle. SBN-based load balancing runs con- 
currently while application processing continues. A topology-independent logical network, such 
as an SBN, helps provide predictable communication patterns to applications that make use of wide 
area networks of processors for both load balancing and interprocessor communication. An SBN 
is also effective when implementing message-passing applications for multicomputer systems in a 
portable manner. 

The SBN topology can easily be embedded into specific networks if efficiency is an issue. For 
example, one of our SBN-based algorithms (Hypercube Variant) naturally adapts to a hypercube 
topology, and is thus used to compare with other hypercube-based dynamic load-balancing meth- 
ods. This helps us determine whether the measured performance improvements are due to the 
effects of the network topology or the proposed load-balancing schemes themselves. 

Based on their operational characteristics, our SBN-based algorithms can be classified (accord- 
ing to [19]) as: (a) Adaptive, since the performance adapts to the average number of queued jobs; 
(b) Symmetrically Initiated , since both senders and receivers can initiate load balancing; (c) Sta- 
ble, since the network is not burdened with excessive load-balancing traffic; and (d) Effective, since 
system performance does not degrade while balancing workloads. 

The performance of the proposed SBN-based algorithms is analyzed by conducting a thorough 
experimental study on a 32-processor SGI 0rigin2000 machine, using the Message-Passing In- 
terface (MPI) paradigm. A preliminary version of this work that describes experiments running 
on an IBM SP2 is available in [3]. Investigating the programming paradigm is beyond the scope 
of this paper. Since the SBN strategy is topology and architecture independent, it made sense to 
use a portable library like MPI. Furthermore, any benefit provided by exploiting the 0rigin2000 
shared-memory architecture would be equally available to all the other load-balancing schemes. 

We use Poisson-distributed synthetic loads and compare the performance with other methods 
such as Random [7], Gradient [14, 15], Sender In itiated [8], Receiver Initiated [8], Adaptive Con- 
tracting [8], and Tree Walking [20], as summarized in Section 2. Our experiments demonstrate 
that the quality of load balancing achieved by the SBN approach compares favorably with respect 
to four metrics: (i) message traffic per processor, (ii) total jobs transferred, (iii) maximum variance 
in processor idle time, and (iv) total completion time. For example, under heavy system loads, 
the SBN algorithms complete in 6% to 22% less time than other balancing algorithms that are 
analyzed in this paper. Idle time is also reduced by over 67%. Under light system loads, the SBN- 
based algorithms incur significantly less message traffic as compared to other popular balancing 
algorithms such as Gradient and Receiver Initiated. 

This paper is organized as follows. Section 2 reviews several existing approaches to dynamic 
load balancing that are used as comparisons. Section 3 defines SBNs, presents some of their 
properties, and discusses the general characteristics of our proposed load-balancing algorithms. 
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Section 4 presents three SBN-based load-balancing schemes and analyzes their performance char- 
acteristics. Section 5 contains experimental results that compare the SBN algorithms to other 
load-balancing techniques. The final section concludes the paper. 


2 Related Work 

A wide variety of dynamic load-balancing algorithms have been proposed for improving multi- 
processor performance [2, 7, 8, 9, 11, 12, 14, 15, 18, 20, 21, 22, 23, 24], Let us first summarize 
the underlying characteristics of some of the most popular methods which are used to compare the 
performance of our SBN-based algorithms. 

Random [7]: If the number of jobs queued at a given processor is larger than a certain threshold, 
additional jobs are randomly distributed among its neighbors. Although a single distribution 
message may contain several jobs, a particular job cannot be migrated multiple times. In 
other words, once a job is migrated, it is queued for processing. 

Gradient [14, 15]: Jobs migrate from overloaded to lightly-loaded processors based on a sys- 
temwide gradient. Each processor maintains, for each of its immediate neighbors, the min- 
imum number of communication hops to the nearest lightly-loaded processor. Whenever 
these values change, they are broadcast to all the neighbors. However, because of network 
dynamics, this is only an approximation to the true system load. Each processor also has a 
load status flag which, by comparison with system thresholds, determines whether the pro- 
cessor is overloaded, lightly loaded, or moderately loaded. Jobs are routed to the neighbor 
lying on the path to the nearest lightly-loaded processor, and a job can migrate several times 
before being finally processed. 

Receiver Initiated [8]: Load balancing is triggered by a processor with load level below the 
system threshold. The lightly-loaded processor broadcasts to its neighbors a job request 
message which contains information about its queue. Upon receiving this message, each 
neighbor compares its own queue length to that of the requesting processor. If the local 
queue size is larger, the neighboring processor replies with a single job. To prevent instability 
under light system load conditions, the requesting processor waits for a specified amount of 
time for a reply before initiating another job request. Like the Gradient scheme, it is possible 
for a job to be migrated multiple times before being finally processed. 

Sender Initiated [8]: Unlike Receiver Initiated, load balancing occurs when processors become 
overloaded. To prevent instability under heavy system loads, each processor exchanges load 
information with its neighbors. More precisely, load values are exchanged when a local 
queue length is halved or doubled, so job migrations occur less frequently than system load 


4 



changes. Additional jobs are distributed to lightly-loaded neighbors. Like Random, multiple 
job migration is not allowed. 

Adaptive Contracting [8 |: When jobs arrive at a processor, it distributes bids to all of its neigh- 
bors. The neighbors respond to the bid with a message containing the number of jobs in their 
respective local queues. The originating processor then distributes jobs to those neighbors 
that have loads smaller than the system threshold. The number of jobs migrated is such that 
jobs are equally distributed among the originating processor and its lightly-loaded neighbors. 

Tree Walking [20]: Utilizing a binary tree topology, the fixed root processor initiates a load- 
balancing operation when one of the processors becomes idle. Processing is temporarily 
suspended when load balancing is underway. First, a balance message is broadcast through 
the network. Processors respond by sending their current load level back towards the root 
using a global reduction operation. The correct systemwide load level is then broadcast. 
Finally, jobs are distributed so that every processor has an equal number of jobs. 

Despite some similarities between our SBN-based approach (details given later in Section 4) 
and the Tree Walking Algorithm (TWA), there are several major differences as outlined 
below: 

• TWA always initiates load balancing from a single root processor. It is, therefore, more 
restrictive and has less potential for being fault tolerant than SBN, which allows any 
processor to initiate load balancing. 

• Application processing is temporarily suspended when TWA executes because it is 
unable to mask balancing overhead by overlapping processing and load balancing. In 
contrast, SBN balancing proceeds concurrently with application processing. 

• TWA is designed to perfectly equalize the job count at every processor leading to more 
message passing. SBN balancing attempts to minimize the possibility that a particular 
processor will become idle; thus, a perfectly equalized job count is unnecessary. 

• TWA triggers load balancing only when processors become idle. SBN anticipates idle 
conditions and triggers balancing ahead of time. 

• TWA requires communication messages to be broadcast to all processors when bal- 
ancing the system. The SBN Heuristic Variant allows balancing to be accomplished 
without this requirement. 

All of these algorithms can be classified as being iterative in nature because they strive to reach 
a global balanced state through successive nearest-neighbor operations by individual processors. 
Iterative methods, in general, are a good match for direct point-to-point interconnection networks 
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commonly used for communications in modern multicomputers. Furthermore, they are flexible, 
preserve communication locality, and are inherently scalable. 

There are two main classes of iterative methods: diffusion [2, 21, 23] and dimension ex- 
change [2, 23, 24], Diffusion algorithms require processors to communicate simultaneously with 
all of its nearest neighbors to reach a local load balance. On the other hand, in dimension exchange, 
a processor balances workload with each neighbor at a time. Diffusion algorithms are more pop- 
ular because of their simplicity; however, their efficiency depends on a parameter that determines 
the behavior of the local balancing operation. In [12], an improved diffusion algorithm based on 
Chebyshev polynomials was proposed. Results showed that performance was better than the base- 
line diffusion method, but at the additional cost of calculating two eigenvalues. The drawback of 
the dimension exchange method is that equidistribution of the workload between a pair of pro- 
cessors at each balance operation is not necessarily efficient. It has therefore been generalized by 
using an exchange parameter to control the workload distribution between pairs of processors [22]. 
For processor networks that can be represented as a cartesian product of graphs, [9] proposed an 
alternating-direction dimension exchange scheme that reduces the number of iterations but at the 
cost of significantly greater job transfers. A refined scheme switches directions every other itera- 
tion to lower the amount of job migration. 

3 Preliminaries 

In this section, we define the concept of symmetric broadcast network (SBN), describe its prop- 
erties, and present the general characteristics of our proposed load-balancing algorithms based on 
SBNs. 

3.1 Definition of SBN 

An SBN defines a communication pattern (logical or physical) among the P processors in a mul- 
ticomputer system [5, 6]. An SBN(d) of dimension d > 0, is a (d + l)-stage interconnection 
network with P = 2 d processors in each stage. It is constructed recursively as follows: 

• A single processor forms the basis network SBN(O) consisting of a single stage, denoted as 
stage 0, with no communication link. 

• For d > 0, an SBN(d) is obtained from a pair of SBN(d - l)s as follows: 

(a) Keep the processor labels in the first SBNfri— 1) unchanged as 0 through 2 d ~ 1 — 1; relabel 
the processors of the second SBN(d — 1) as 2 d ~ 1 through 2 d — 1. 

(b) Create an additional communication stage d, containing processors 0 through 2 d — 1. 
Connect each processor j in stage d— 1 to processor (j + 2 d_1 ) mod 2 d in stage d. 
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SBN(2) SBN(3) 



Figure 1: Construction of SBN(3) from a pair of SBN(2)s. Dashes indicate the new connections. 

(c) If stage d — 2 exists, for each processor j in stage d — 2, define k to be the successor of 
j in stage d — 1. Likewise, define m to be the successor of k in stage d (as just created in 
step (b) above). Connect processor j in stage d — 2 to processor m in stage d — 1. 

An example of how an SBN(3) is formed from two SBN(2)s is shown in Fig. 1 . An SBN defines 
unique communication patterns among the processors in the network. For any source processor at 
stage d of SBN(d), there are d = logP stages of communication where each processor appears 
exactly once. The successors and predecessors of each processor in a given stage i are uniquely 
defined by specifying the label of the originating processor and the communication stage number. 
Messages originating from source processors are appropriately routed through the SBN. 

3.2 Properties of SBN 

Among a total of eight possible communication patterns in SBN(3), consider the two patterns 
shown in Fig. 2. The paths in Fig. 2(a) are used to route messages originating from processor 0, 
while those in Fig. 2(b) are for messages originating from processor 5. Let n| denote a proces- 
sor label at stage s in Fig. 2(b) and let ng be the corresponding processor label in Fig. 2(a). Then, 
n s 5 = rio®5, where ® is the exclusive-OR operator. In general, if n s x is the corresponding processor 
in the communication pattern for messages originating from source processor x, then n x = n s 0 ®x. 
Thus, all SBN communication patterns can be derived from the template pattern having proces- 
sor 0 as the root. The predecessor and two successors of can then be computed as follows: 

Predecessor: ((n^ - 2 s ) V 2 S+1 ) mod 2 d , if 0 < s < d (V is the inclusive-OR operator), 


7 


Stage 3 Stage 2 Stage 1 Stage 0 


Stage 3 Stage 2 Stage 1 Stage 0 



Figure 2: Two communication patterns in SBN(3). 

Successor_l: n s Q + 2 s-1 , if l < s < d, 

Successor_2: n s Q — 2 s-1 , if 1 < s < d. 

Figure 2 illustrates two possible SBN communication patterns, but many others can easily be 
derived by slightly altering the SBN definition to match a given network topology and application 
requirements. Multiple randomly-selected SBN patterns help distribute messages more evenly, 
enhance network reliability, and allow various applications to be written using different communi- 
cation patterns. For example, the SBN communication pattern for processor 0 can be defined with 
the help of a one-dimensional array implementation of a full binary tree such that the predecessor 
and successors of a processor are given by: 

Tl S 

Predecessor: , if 0 < s < d. 

Successor^ : 2 x + 1, if 1 < s < d, 

Successor_2: 2 x ng, if 1 < s < d. 

Let us demonstrate how to embed a hypercube topology within an SBN, as will be required 
for the Hypercube Variant of our load-balancing algorithm. This embedding uses a modified bi- 
nomial spanning tree, which consists of two binomial trees 1 connected back-to-back. Figure 3 
shows such a communication pattern for a 16-processor network which is used to route messages 
originating from processor 0. The solid lines of the diagram represent the actual SBN(4) pattern, 
whereas the dashed lines are used to gather load-balancing messages at a single destination proces- 
sor (processor 15, in this case). This embedding ensures that all successors and predecessors at any 
communication stage are adjacent processors in the hypercube. Also, every originating processor 
has a unique destination. Finally, if the processors are numbered using a binary string of d bits, 
the number of predecessors for a processor x is max {1, b}, where b is the number of consecutive 
leftmost 1-bits in the binary address of x. 

1 A binomial tree B{k) of 2 k nodes is an ordered tree defined recursively as follows [1]: 23(0) consists of a single 
node; B(k) consists of two B(k — l)s linked together such that the root of one is the leftmost child of the root of the 
other. 
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Stage 4 Stage 3 Stage 2 Stage 1 Stage 0 



Figure 3: Modified binomial spanning tree used as a hypercube SBN(4). 

3.3 General Characteristics of Proposed Load Balancers 

We discuss below some general features that are shared by all three SBN-based adaptive load- 
balancing algorithms presented in Section 4. We also describe various system thresholds, the two 
types of messages that are processed by the SBN approach, and pseudo codes for the procedures 
common to all three algorithms. 

3.3.1 System Thresholds 

In general, many load-balancing algorithms are very susceptible to the choice of system thresh- 
olds [16]. A proper selection of these threshold values has proven helpful in optimizing our algo- 
rithms as well. 

The proposed SBN-based algorithms adapt their behavior to the system load. Under heavy 
(respectively, light) loads, the balancing activity is primarily initiated by processors that are lightly 
(respectively, heavily) loaded. This activity is controlled by two system thresholds : MinTh and 
MaxTh, which are the minimum and the maximum system load levels. The system load level, 
SysLL, is the average number of jobs queued per processor. If a processor p has a queue length 
QLen(p) < MinTh, load balancing is initiated. If, on the other hand, QLen(p) > MaxTh, the 
extra jobs ExLoad = QLen(p) — MaxTh are distributed through the network, without explicit load 
balancing. However, if this distribution overloads a processor in the final stage (stage 0), load 
balancing is triggered. 

The performance of our algorithms is affected by the chosen values for MinTh and MaxTh. For 
instance, MinTh must be large enough to receive sufficient jobs before a lightly-loaded processor 
becomes idle. However, it should not be too large to initiate unnecessary load balancing. Similarly, 
if MaxTh is too small, it will cause an excessive number of job distributions; if it is too large, jobs 
will not be adequately distributed under light system loads. Basically, once there is sufficient load 
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on the network, very little load-balancing activity should be required. 


3.3.2 Message Types 

Two types of messages are processed by the SBN-based algorithms. The first type is a balancing 
message which originates from an unbalanced processor and is then routed through the SBN. The 
cumulative total of queued jobs, Total JQ, is computed to obtain a snapshot value of SysLL. 

The second type of messages is a distribution message, which is used during a load-balancing 
operation to route the Total JQ value though the network and to reassign jobs after the balancing 
message is broadcast. Each processor upon receipt of such a message, updates its local values 
of SysLL, MinTh, and MaxTh. Distribution messages can also be sent to predecessor processors 
prior to completing the broadcast of the balancing message. This action occurs so that jobs can 
be assigned to predecessors having less than MinTh jobs queued and about to go idle. Finally, 
distribution messages are sent when a load-balancing operation is not in progress and a processor 
has greater than MaxTh jobs queued. In this case, ExLoad jobs are reassigned. 

If the communication from one processor to its neighbors is completed in constant time, a single 
load-balancing operation requires O (log P ) time since there are d+ 1 = (log P ) + 1 communication 
stages in SBN(d). However, if multiple balancing operations are processed simultaneously, the 
worst case complexity can be shown to be 0( log 2 P) [5]. To reduce message traffic, a processor 
does not initiate additional load-balancing activity until all previous balancing messages passing 
through it have been serviced. 

3.3.3 Common Procedures 

All three of our load-balancing algorithms consist of four key procedures. The first two, GetBal- 
ance and GetDistribute, are used to process balancing and distribution messages that are received, 
while the other two. Balance and Distribute, route those messages to the successors in the SBN. Im- 
plementation details of these procedures depend on the particular load-balancing algorithm used. 
Figure 4 presents the pseudo code that is common to all our SBN-based load-balancing algorithms, 
and is executed in parallel by every processor. The UpdateLoad procedure adjusts the system 
thresholds described in Section 3.3.1. It is called by the GetBalance and GetDistribute procedures 
when load-balancing operations complete. 

In our experiments, we found that setting Const Param = 2 proved to be the most effective 
value. MinTh is then set so that load balancing will begin before processors go idle. The setting 
of MaxTh grows exponentially with SysLL because the need for load-balancing activity decreases 
rapidly as SysLL increases. The mathematical justification for this policy is presented in Sec- 
tion 4.4. 
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Procedure Main 
Repeat forever 

Call GetBalance to process balancing messages received 
Call GetDistribute to process distribution messages received 
If (QLen(p) > MaxTh) 

ExLoad = QLen(p) — MaxTh 

Call Distribute to route ExLoad jobs through the SBN 
If (QLen (p) < MinTh) 

Call Balance to initiate a balance operation and determine Total JQ 
Resume processing of application program 

End Repeat 

Procedure UpdateLoad{ TotalJQ) 

SysLL — [ TotalJQ / P] 

MaxTh = SysLL + 2 L S y sLL / ConstParam J 

MinTh = SysLL — 1 

If (SysLL > ConstParam) MinTh = ConstParam 

Return 


Figure 4: Common pseudo code for SBN-based load-balancing algorithms. 

4 SBN-Based Load-Balancing Schemes 

Based on the SBN concept, a baseline dynamic load-balancing technique and two variants are 
proposed in this paper. The Basic algorithm and the Hypercube Variant attempt to accurately 
compute and maintain the value of SysLL, whereas the Heuristic Variant estimates it. 


4.1 Basic SBN Algorithm 

In this algorithm, balancing messages are routed through SBN(d) from the source (root) at stage d 
to all the processors at stage 0. These messages are then routed back to the root so that TotalJQ 
can be computed. Thus, the originating root processor has an accurate snapshot value of SysLL. 
Next, distribution messages are sent to relocate jobs and to broadcast the TotalJQ value. All 
processors then update their local SysLL, MinTh, and MaxTh. The extra load of jobs (ExLoad) are 
routed as part of this distribution to balance the system load. In addition, if QLen(p) < SysLL for 
a processor p, the need for jobs is indicated during the distribution process. Successor processors 
respond by routing back an appropriate number of excess jobs, if available. Figure 5 presents the 
pseudo code of the Basic SBN algorithm where Stage (p) denotes the SBN stage of processor p 
for a given communication pattern (cf. Fig. 2) and JobsRecv is the number of jobs received by it 
in a distribution message. 

To illustrate the operation of this algorithm, consider SBN(3) in Fig. 6(a). The label and 
QLen(p) for each p are shown inside and outside the corresponding circle. For example, proces- 
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Procedure GetBalance 

While (balancing messages remain to be processed) 

If (QLen(processor sending this message) < MinTh) 

Route l QLen(p) / 2 J jobs to processor sending this message 
If (no balance operation active from the SBN root processor) 

Increment count of simultaneous balance operations being serviced 

If (Stage(jp) == 0) Call Balance to route QLen(p) value towards the root processor 

Else 

Indicate two balancing messages to be gathered from successors 
Call Balance to route balancing message to successors 

Else 

If (second balancing message remains to be gathered from successors) Continue 
If (Stage(p) == d) 

Call UpdateLoad(Totdil3Q) 

ExLoad = QLen(p) — SysLL 

Call Distribute to route ExLoad jobs and Total JQ value to successors 
Decrement count of simultaneous balance operations being serviced 
Else Call Balance to route (QLen(p) + QLen(successors)) value to the root processor 

End While 
Return 

Procedure GetDistribute 

While (distribution messages remain to be processed) 

QLen(p) = QLen(p) + JobsRecv 

If (Total JQ value contained in message received) 

Call UpdateLoad( TotalJQ) 

Route min (QLen(p) — SysLL, SysLL — QLen(predecessor)) jobs to predecessor 
Decrement count of simultaneous balance operations being serviced 
If (Stage(p) == 0) 

If (QLen(p) > MaxTh) Call Balance to initiate new balance operation 

Else 

ExLoad = QLen(p) — MaxTh 

If (this is a balance operation) ExLoad = QLen(p) — SysLL 

Call Distribute to route ExLoad jobs and/or Total jq value to successors 

End While 
Return 

Procedure Balance 
If (St age {p) =— d) 

If (balance operation already underway) Return 
Increment count of simultaneous balance operations being serviced 
Indicate that a balancing message is expected from successor 
Route the balancing message to next SBN stage 

Return 

Procedure Distribute 

If ((this is a non-balance distribution) And (balance operation already underway)) 

Inhibit the distribution and Return 
QLen(p) = QLen(p) — ExLoad 

If ((ExLoad > 0) Or (TotalJQ value to send)) Route the distribution message to successors 

Return 

Figure 5: Pseudo code for the Basic SBN algorithm. 
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Figure 6: An example of load balancing using the Basic SBN algorithm: (a) during balancing 
messages, and (b) during distribution messages. 

sor 6 has Q = 3 jobs queued for execution. At processor 0, the initial values are SysLL = L = 4, 
MinTh = m — 2, and MaxTh = M = 6. After a load-balancing request is sent through the SBN 
and then routed back to processor 0, these values are updated as L = 8, m = 2, and M — 24, using 
the UpdateLoad procedure given in Fig. 4. Note that when the balancing is initiated, processor 4 
distributes half of its QLen(p) jobs (i.e. |_3/2J = 1) back to processor 0. This distribution is shown 
by the label on the arrow in Fig. 6(a). 

Distribution messages are then used to route excess jobs to the successor processors or to 
indicate a need for jobs if QLen(p) < SysLL. Jobs are routed back to the predecessors when 
appropriate. Figure 6(b) shows the result of this distribution where the labels on arrows indicate 
the number of jobs routed between processors. 

To balance P processors, P — 1 balancing messages are first sent through the SBN, which are 
then routed back to the root processor so that the SysLL value can be computed through a global 
reduction operation. Finally, P — 1 distribution messages are sent to balance the load as well as 
broadcast the Total JQ value through the network. Note that, if a processor has an immediate need 
for jobs that can be supplied during load balancing, additional distribution messages are sent from 
neighboring processors to satisfy the need. If J such messages are required, a total of 3 P — 3 + J 
messages will be have to be processed. Since J — O(P), the total number of messages to be 
processed is O(P). The depth of the SBN network being O(logP), the total time required to 
complete a load-balance operation is O(logP). 

4.2 Hypercube Variant 

The SBN approach has been adapted for implementation on a hypercube topology, using the mod- 
ified binomial spanning tree as illustrated in Fig. 3. This algorithm uses the same control variables 
(SysLL, MinTh, and MaxTh), and processes balancing and distribution messages in the same manner 
as the Basic SBN algorithm. However, in embedding the SBN onto the hypercube, a processor r at 
stage d sends a balance message to each of its adjacent processors at stage d — 1 of the hypercube. 


13 



These messages are then routed through the hypercube network and eventually collected at the 
single destination processor q at stage 0. Processor q then accurately computes the current SysLL 
value and initiates the distribution process by routing distribution messages back towards the root 
processor at stage d. Note that the exclusive-OR property described in Section 3.2 still holds: it 
allows any processor to correctly determine the successors and predecessors if the stage number 
and the root processor are given. 

Other differences between the Basic SBN algorithm and the Hypercube Variant are as follows: 

• In the Hypercube Variant, the value of Total JQ is computed when all balancing messages 
arrive at the destination processor. Unlike the Basic algorithm, this is possible because 
there is a unique destination for every originating processor in the hypercube embedding. 
Distribution messages are then routed back to complete the load balancing. Since there are 
P — 1 + P/2 — 1 interconnections in the modified binomial spanning tree (cf. Fig. 3), a 
load-balancing operation requires 3P - 4 messages excluding the distribution messages sent 
between neighbors to satisfy the immediate need for jobs. 

• Balancing messages always proceed from the root processor (at stage d of SBN in the Hyper- 
cube Variant) towards stage 0. This contrasts with the Basic SBN algorithm where balancing 
messages first proceed from stage d to stage 0, and are then routed back to stage d. 

• To minimize the communication overhead in the Hypercube Variant, messages are gathered 
from the previous stage whenever more than one message is expected. The Basic SBN 
algorithm only needs to gather messages that are being routed back toward the root processor 
(because messages going in the other direction have only one predecessor). 

• The network topology for the Hypercube Variant is such that the number of predecessor and 
successor processors vary at different communication stages, thereby somewhat complicat- 
ing the implementation. 

4.3 Heuristic Variant 

Both the Basic algorithm and the Hypercube Variant are expensive since a large number of mes- 
sages has to be processed to accurately maintain the SysLL value. The Heuristic Variant attempts 
to reduce this overhead by terminating load-balancing operations initiated by processor p as soon 
as QLen(p) is sufficiently large. In general, this strategy reduces the number of messages although 
O(P) messages are still needed in the worst case. The pseudo code in Fig. 7 gives the operational 
details of the Heuristic Variant of the SBN-based load-balancing algorithm. 

As in the Basic algorithm, p initiates load balancing when QLen(p) < MinTh by sending a 
balancing message to its SBN successor. However, in the Heuristic Variant, the processor r that re- 
ceives this message estimates SysLL by averaging local queue lengths over the processors through 
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Procedure Ge t Balance 

While (balancing messages remain to be processed) 

TotalJQ = j" P x (QLen (p) + QLen(predecessors)) / (d - Stag e(p) + 1) ] 

OldSysLL = SysLL 

Call UpdateLoad( TotalJQ) 

ExLoad = [ QLen(p) / 2 J 
If (ExLoad > 0) 

If (QLen(p) < SysLL) 

Call Balance to route balancing message to successors 
Else 

Call Distribute to route ExLoad jobs to predecessor 

End While 
Return 

Procedure GetDistribute 

While (distribution messages remain to be processed) 

JobsQueued = QLen (p) 

QLen (p) = QLen(p) + JobsRecv 
If (distribution towards the root processor) 

TotalJQ = Px (JobsQueued) + [" JobsRecv / (d — Stage(p) -hi)]) 
Exload = [ QLen(p) / 2 J 

Else TotalJQ = P x (JobsQueued 4- [ JobsRecv / (2 ( Sta s e (p)+ l ) — 1) ] ) 
ExLoad = QLen(p) — SysLL 
Call UpdateLoad(7 ota.UQ) 

H (ExLoad > 0) 

Call Distribute to route ExLoad jobs to next SBN stage 

End While 
Return 

Procedure Balance 
If (Stage {p) == 0) Return 
If (Stage (p) == d) 

TotalJQ — P x QLen(p) 

Call Update Load(TotalJQ) 

Route the balancing message to successors 
Return 

Procedure Distribute 
If (Stage (p) == 0) Return 
If (Stage (p) == d) 

TotalJQ = P x (SysLL + [ QLen(p) - SysLL / P ] ) 

Call UpdateLoad(TotdilJQ) 

ExLoad = QLen(p) — SysLL 
QLen(p) = QLen(p) — ExLoad 

Route ExLoad jobs with the distribution message to next SBN stage 

Return 

Figure 7: Pseudo code for the Heuristic Variant. 
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which the balancing message has already passed. If QLen(r) > SysLL, an appropriate number 
of jobs (ExLoad = [QLen(r)/2j) is returned via a distribution message as shown in the GetBal- 
ance procedure in Fig. 7. In this case, the load-balancing procedure is also terminated. If instead, 
QLen(r) < SysLL, the balancing message is forwarded to the next SBN stage. Mathematical 
justifications for the Heuristic Variant are discussed in the next subsection. 

Job distribution is also accomplished differently in the Heuristic Variant. If p determines that 
QLen(p) > MaxTh, it implies a job distribution is necessary. A new estimate of SysLL is calcu- 
lated by p as SysLL = SysLL + [(QLen(p) - SysLL) / P]. It then tries to evenly distribute the 
excess load among all the processors in the network. Each processor r receiving a resulting distri- 
bution message with JobsRecv jobs updates its own SysLL and QLen(r) values. However, before 
updating QLen(r), its SysLL is computed as SysLL = QLen(r) + [JobsRecv j R ], where R is 
the number of processors (including r) in the remaining SBN stages. Note that SysLL is updated 
based on QLen(r) and not its original outdated value. Processor r, in turn, distributes jobs in excess 
of its own updated SysLL to the next SBN stage. This is reflected in the formulae shown in the 
GetDistribute procedure in Fig. 7. 

As an illustration, consider an SBN(3) that has a processor p with SysLL = 7, MaxTh = 15, and 
QLen(p) = 24. The newly-computed SysLL value is 7+ [(24-7)/8] = 10. The number of jobs that 
will be distributed to the successor (only one successor at this stage) is JobsRecv = 24 — 10 — 14. 
Suppose that the successor r has SysLL = 9, and QLen(r) = 6. After receiving 14 jobs from p, it 
has SysLL = 6 + [14/7] = 8. Thus, 20 - 8 = 12 excess jobs will be distributed to the next stage. 

A significant advantage of the Heuristic Variant is that the balancing messages do not have to 
be gathered in order to calculate SysLL. This reduces the interdependencies associated with the 
communication and allows fault tolerance. If a particular processor fails, load balancing can still 
be accomplished with the help of the remaining processors. 

An additional improvement for both the Basic SBN load-balancing algorithm and its Heuristic 
Variant can be obtained by using multiple communication patterns in the SBN. Each time a mes- 
sage is initiated, one of the SBN patterns is randomly chosen. Our experiments make use of the 
two communication patterns mentioned in Section 3.2 for computing predecessors and successors. 
Each of the balancing and distribution messages includes the source processor, the pattern used, 
and the stage to which the message is being routed. Since all processors have the SBN template 
associated with messages originating from processor 0, the required SBN communication pattern 
can be determined. 

4.4 Analysis of SBN Message Passing 

In a multicomputer consisting of P processors, we assume the arrival of jobs can be modeled by 
a Poisson distribution such that the probability of a processor having j jobs is X j j (e x j\), where 
A is the mean arrival rate. If SysLL = k, then by definition, the average number of jobs assigned 
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to a processor is k. Hence, the probability that a processor has exactly j jobs is k j / (e k j\). 
This implies the probability, g 3 , that a processor in the network has more than j jobs is given by 
g 3 = 1 — (Eio **/»!) / e k . For example, the probability that all P processors have more than 
three jobs is (g 3 ) p . Now, (g^) p > 0.9 if k > 5, and is almost unity if k > 15. This implies that the 
need for load-balancing activity rapidly decreases as SysLL increases. Therefore, it makes sense 
to increase MaxTh exponentially with increasing SysLL. 

To analyze the Heuristic Variant, let us compute the expected number of jobs, EJobs, that are 
returned to the processor that initiates the SBN load-balancing algorithm. We also compute the 
expected number of processors, EProcs, that will be visited during a balancing operation. In this 
way, we can determine whether sufficient jobs are returned and if message traffic is reduced by 
utilizing the Heuristic Variant. In the following, we mathematically model SBN message passing. 
For this purpose, we define the following four probability vectors: 

• $ = (0Oi (pi, • ■ • , 4> n ) defines the distribution of jobs queued at a given processor, where <p l 
is the probability that the processor has i jobs queued for a specified value of SysLL. 

• $stop = (0, 0, • • • , 0, pstop, fstop+ 1 ) • • • , Pn) is the likelihood that balancing will terminate at 
a given processor. Here, stop is the number of jobs queued at a processor that will prevent 
the SBN Heuristic Variant from forwarding a balancing message to the next stage. 

• ^continue — {<Po, <Px, • • ■ , Pstop-i , 0, ■ • • , 0) defines the likelihood that balancing messages 
will be sent to successor processors. 

• C(V) is the probability vector computed by applying the function C to the probability vector 
V, and indicates the number of jobs distributed to a processor’s predecessor. Here, V is the 
vector defining the probabilities of jobs queued at a processor after receiving distribution 
messages from its successors. 

Proposition 1 If Vj = (u lc , t'n, • • • , v Xn ) and V 2 = (v 2 o, n 21 , • ■ • , v 2m ) are probability vectors, 
then Vj 0 V 2 = (v 0 , Vi, ■ ■ ■ , v m+n ), where v t = Ylj+k=i v ij v 2 k is also a probability vector. 

Proof Consider the product (]CiLo u 2j)- Since Vj and V 2 are probability vectors, 

yj"_ 0 Vu = v 2 j = 1 and the above product is unity. This product can also be written as 

^ 10(^20 + v 2 \ H v 2m ) -I 1- vi n (v 2 o + v 2 \ H f v 2m ). All the terms in this Cartesian product 

can be reorganized as (v w v 2() ) + (vi 0 v 2 i + v u v 2Q ) + {v w v 22 + vnv 2X + v x2 v 20 ) + • ■ •, where the 
expression within each pair of parenthesis is a vector entry of Vj (^) V 2 . Since we have shown that 
this sum is unity, Vj 0 V r 2 is also a probability vector. □ 

Utilizing the above definitions, define Rk = (jko,jki, • • • ,jkn) as a probability vector, where 
jki is the probability that i jobs from an SBN processor at stage k are returned to a predecessor 
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at SBN stage k + 1 during a balancing operation. Then, R d - i indicates the jobs returned to the 
root processor that initiated the load-balancing operation, and can be computed by the following 

recursive formula: 

Stage 0: i? 0 = C{^)> anc * 

Stage 1 < k < d: Rk = C{<& s top + ^ continue ® ^ Rk-\)* 

where the function C reflects the SBN job return policy as charted in Fig. 8. Basically, the calcula- 
tion function is used to add the jobs queued locally to the jobs that are expected to be returned from 
the two successor processors. Once R d - i = • ■ >r n) is computed, EJobs — Y^j = oO x r o ) 

can also be calculated. 

To compute EProcs, let cf> c = <t> 0 + <f>i + • • • + <t>n be the probability that a given processor 
has c > stop number of jobs queued. The probability that a balancing message will reach a 
processor at stage k < d in the SBN is therefore <j> d - k ~\ since the message must pass through 
d-k- 1 predecessors. Since the number of processors at stage k< din SBN is 2 d ~ k \ we have 
EProcs = x 2 d "‘ /c-1 )- For example, if d — 5 and 4> c = 0.4, the expected number of 

processors visited during a balancing operation is 3.3616. 

From the above model, we can compute EJobs and EProcs for various values of the system 
load level (SysLL), the dimension d of the SBN, and the value of stop. Assuming a Poisson job 
arrival rate, the probability vector $ is easily obtained. The vector C(V) can be calculated by 

utilizing the job return policy of the Heuristic Variant. 

Figure 8 plots EProcs and EJobs values for this mathematical model. Here, P = 32 and SysLL 
varies between 1 and 15. In the graphs, we analyze five prospective policies for a processor to ter- 
minate load balancing, and compare their expected performance against the Basic SBN algorithm. 
Recall that when a processor p executes the Basic algorithm after receiving a balancing message, 
this message is forwarded to the next SBN stage unless p is at stage 0. The five policies for p to 
stop balancing are: 

V t : when QLen(p) > SysLL + 5 - i for 1 < i < 4, 

7Y- when QLen(p) > 2 if SysLL < 2, or when QLen(p) > 4 for other SysLL values. 

Notice that policy V 5 attempts to return at least one job under light system loads and at least two 
jobs under heavier loads. 

Figure 8(a) shows that if the Heuristic Variant is used with stop = SysLL + 1 (policy V 4 ), 
EProcs is significantly less than that of the Basic SBN algorithm. For example, if P = 32 and 
SysLL = 4, only about 8 processors are visited on average using the Heuristic Variant, whereas all 
31 processors are visited for the Basic algorithm. Furthermore, with the Heuristic Variant, Fig. 8(b) 
indicates that EJobs = 4.2 (compared to 8.8 with the Basic algorithm), a value much closer to the 
SysLL value of 4. Therefore, a better load balance is achieved with significantly reduced balancing 

traffic. 
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Figure 8: Expected number of (a) processors visited (EProcs), and (b) jobs returned (EJobs) for 
P = 32. 

Figure 8(b) also gives an indication about optimal stop values with respect to SysLL. After a 
successful load-balancing operation, QLen(p) for each processor p should approximate the SysLL 
value. By the definition of SBN load-balancing algorithms, the processor that initiates load balanc- 
ing has less than MinTh jobs queued for processing. EJobs will then be optimal when it is almost 
equal to SysLL. Figure 8(b) shows that this objective is achieved when stop = SysLL + 1. This is 
intuitively correct because it implies that a processor will forward a load-balancing message to the 
next SBN stage until an overloaded processor is encountered. 

5 Experimental Study 

This section describes the simulation environment and the test cases that were used to compare the 
proposed SBN-based load-balancing algorithms with several existing methods. The performance 
metrics used for comparison are explained, and comprehensive results obtained from simulation 
experiments are presented. 

5.1 Simulation Environment 

All the load-balancing algorithms were implemented using MPI and tested with synthetically- 
generated workloads on a 32-processor SGI 0rigin2000 located at NASA Ames Research Center. 
We used MPI rather than OpenMP to implement the various algorithms since investigating the 
programming paradigm is beyond the scope of this paper. The SBN strategy is topology and archi- 
tecture independent; thus it made sense to use a portable library like MPI. Furthermore, any benefit 
provided by exploiting the 0rigin2000 shared-memory architecture would be equally available to 
all the load-balancing schemes. 
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The simulation program spawns the appropriate number of child processes and creates the de- 
sired SBN topology. A list of all process labels and an initial distribution of jobs are then routed 
through the network. In addition to the initial load, each processor dynamically generates addi- 
tional jobs to be processed. Specifically, 10 job creation cycles are run where the number of jobs 
generated at each processor during a cycle follows a Poisson distribution. By randomly picking 
different values of A (the mean arrival rate), varying numbers of jobs are created. Therefore, both 
heavy and light system load conditions are dynamically simulated. We have confined the scope of 
this paper to independent tasks only. Jobs are processed by “spinning” for the designated amount 
of time. The simulation terminates when all jobs have been processed. Note that as the number of 
processors increases, the workload also increases by the same factor. It is thus expected that the 
completion time should remain relatively constant if the algorithm scales efficiently. 

Our experiments compare the performance of the SBN-based load-balancing algorithms with 
six other commonly-used techniques. They are Random, Gradient, Receiver Initiated, Sender Ini- 
tiated, Adaptive Contracting, and Tree Walking, as summarized in Section 2. The Basic SBN 
scheme, its Heuristic Variant, and the Tree Walking algorithm utilize the SBN topology in their 
implementation. The SBN topology, being somewhat similar to the binary tree structure required 
by Tree Walking, provides a direct and fair comparison. The other algorithms and the SBN Hyper- 
cube Variant are implemented utilizing a hypercube topology. This also demonstrates the ability 
to embed the SBN approach onto another topology, and provides a fair comparison between the 
Hypercube Variant and other load-balancing methods. By comparing the Hypercube Variant and 
the Basic SBN algorithms, we can determine whether a change in topology results in any signif- 
icant performance difference. Finally, the same experiments are also performed without any load 
balancing, in order to have a reference point. All ten algorithms (six existing load balancers, three 
proposed schemes, and no load balancing) are implemented using the same hardware and software 

environment. 

The results obtained from these experiments are shown in Figs. 9, 10, and 11. They were 
compiled from repeating the simulations 10 times and averaging the results. We were limited to 
10 runs by system usage limitations. Three separate load scenarios are considered as described 

below: 

Heavy System Load (cf. Fig. 9): An initial load of 10 jobs per processor is queued during the 
first cycle of execution. Each execution cycle, including the first, is 1.0 sec in duration. At 
the start of the remaining nine cycles, an average of 19.41 additional jobs are generated in ac- 
cordance with the formula 200A J / (e A j!), where A and j are randomly varied between 1 and 
10. The constant, 200, is large enough to ensure that the time required to process the created 
jobs is approximately double the 10.0 secs of execution (1.0 sec for each execution cycle), 
thereby guaranteeing an overloaded condition. The duration of all jobs average 0.1 sec, with 
the longest jobs requiring 0.2 sec. The entire simulation requires 10.0 secs plus the time 
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needed to empty out all of the job queues. Optimal balancing for this experment requires an 
average of 1.0 sec to process the initial load, plus 19.41 x 9 x 0.1 secs for processing the 
additional load cycles, amounting to 18.469 secs. 


Transition from Heavy to Light System Load (cf. Fig. 10): An initial load of 50 jobs per 
processor is queued to a small subset (log P) of processors during the first execution cycle. 
This ensures the initial heavy load condition. Each execution cycle, including the first, is 
4.0 secs in duration. At the start of the remaining nine cycles, an average of 12.96 additional 
jobs are generated using the formula 260A J / (e A j!), where the values of A and j are randomly 
varied between 1 and 20. The constant, 260, is chosen so that a light load of jobs will be 
generated at each execution cycle. The duration of all jobs average 0.2 sec, with the longest 
jobs requiring 0.4 sec. If load balancing is effective, the entire simulation requires 40.0 secs 
(4.0 secs for each execution cycle). Note that 10.0 secs is required to process the initial load, 
plus 12.96 x 9 x 0.2 secs for processing the remaining cycles. This totals to 33.328 secs, 
leaving an average of 0.667 sec of idle time per cycle. 

Light System Load (cf. Fig. 11): This experiment is similar to the previous one except that 
the initial load of jobs is very light. Specifically, an initial load of one job per processor is 
queued to a small subset (log P) of processors during the first cycle of execution. Therefore, 
a light system load exists throughout the experiment. The entire simulation requires 40.0 secs 
(4.0 secs for each execution cycle), if load balancing is effective. Note that 0.2 sec is required 
to process the initial load plus 12.96 x 9 x 0.2 secs to process the remaining cycles. This 
totals 23.528 secs, leaving an average of 1.647 secs of idle time per cycle. 

5.2 Performance Metrics 

The data and bar charts included in Figs. 9-1 1 measure the comparative performance of the various 
load-balancing algorithms on a 32-processor SGI Origin2000. The X-axis of the bar charts show 
the number of processors used. The F-axis tracks the following metrics: 

Message Traffic Comparison: The total number of balancing and distribution messages that 
were exchanged during the simulation. 

Total Jobs Transferred: The total number of jobs that were transferred from one processor to 
another. If a job is transferred multiple times before execution, each transfer is individually 
counted. Note that it may have been appropriate to count multiple job transfers only once 
since an actual data transfer would incur most of the overhead. However, the Total Jobs 
Transferred metric as defined can be useful in that it gives an indication of the flexibility 
of an algorithm: its ability to adapt to a rapidly changing dynamic load environment. For 
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example, load-balancing algorithms that do not allow multiple transfers would be the least 
flexible and thus expected to generate the smallest values for Total Jobs Transferred. Thus, 
a load-balancing scheme returning a high value of this metric is not necessarily undesirable. 

Maximum Variance in Idle Time: The difference in processing time (in secs) between the 
busiest processor and the least busy processor. 

Total Time to Complete: The total amount of elapsed time (in secs) before all jobs are fully 
processed. 


5.3 Summary of Results 

As mentioned earlier, the Basic SBN-based load-balancing algorithm (in short, SBN), its Heuristic 
Variant (SBZ), and the Tree Walking algorithm (TWA) were implemented using the SBN topology, 
while the other load-balancing schemes were implemented assuming a hypercube topology. Recall 
that the Hypercube Variant of SBN (CUBE) utilizes the Basic SBN algorithm adapted for the 
hypercube. Analyzing CUBE, we can determine whether performance improvements are due to 
the proposed load-balancing algorithms or due to the SBN topology. The following paragraphs 
analyze each of the four performance metrics measured in our experiments. 

With respect to the Message Traffic Comparison metric, the Gradient algorithm (GRAD) gen- 
erates, by far, the largest amount of message traffic. The Receiver Initiated algorithm (RECV) 
also generates a large number of messages because of its tendency to become unstable under light 
system loads. Idle processors can flood the system with job request messages in situations where 
their neighbors do not have excess jobs to transfer. To alleviate this condition, we introduced a 
0.1 sec delay between job requests. Longer delays tend to reduce the load-balancing effective- 
ness of RECV under light loads. As expected, SBZ generates less message traffic than the other 
two SBN schemes. Likewise, all SBN-based algorithms incur less communication than TWA. In 
genera], the algorithms that do the worst in terms of load balancing require little or no message 
communication. For example, no load balancing (NOBAL) does not generate any message traffic, 
proving that some interaction among processors is necessary to balance a system load. 

Next consider the Total Jobs Transferred metric. During the heavy system load test (cf. Fig. 9), 
the Total Jobs Transferred values are significantly less than that in the other experiments (cf. 
Figs. 10 and 11). This is due to the fact that during heavy loads, most processors are busy and 
not seeking additional jobs to process. Under the light system load test (cf. Fig. 11), the situation 
changes. Here, the SBN-based algorithms generate the largest values because of their tendency 
to pass jobs through more processors in order to satisfy those with low loads. It is important to 
realize that the data associated with jobs need to be transferred only once, just before a job is about 
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to execute. Note also that the SBN algorithms utilize bulk transfers in sending distribution mes- 
sages to relocate jobs. These characteristics reduce the negative effects of message latency and 
minimize the additional overhead that is incurred. As would be expected, the Random (RAND) 
and Sender Initiated (SEND) algorithms that allow only one job transfer before execution, have 
the smallest Total Jobs Transferred values. However, an interesting result is that Adaptive Con- 
tracting (ACWN), which also does not allow jobs to be rerouted more than once, has a higher Total 
Jobs Transferred measure. This is due to the fact that ACWN does more work to balance the load 
initially (when jobs arrive) than the other algorithms. In fact, during the heavy system load test (cf. 
Fig. 9), Total Jobs Transferred values for ACWN is among the highest. 

When considering the Maximum Variance in Idle Time metric, NOBAL obviously performs 
the worst by far. RAND, although reducing the idle time, is much less effective than the others. 
SEND and ACWN have similar performance, which is somewhat better than RAND. Note that 
RAND, ACWN, and SEND do not allow multiple job migrations. This feature prevents these 
algorithms from efficiently adapting to a dynamic job execution environment. TWA shows a large 
imbalance under light system loads (cf. Fig. 11) that could stem from its lack of using a minimum 
processor workload threshold. 

The last metric is the Total Time to Complete. Recall that the amount of work is increased 
proportionately with the number of processors (e.g., twice as much processing is required for 
P = 8 than for P = 4). Under light loads (cf. Fig. 1 1), only NOBAL and TWA fail to complete 
within the optimal amount of time (40.0 secs). This is consistent with the results discussed above 
where both approaches show a large variance in processor idle time. Similarly, under the heavy- 
to-light load test (cf. Fig. 10), most of the algorithms finish at near-optimal times (approximately 
40.0 secs). Here, only NOBAL and RAND could not process the job queues within the expected 
time. If we observe the values from the chart in Fig. 9 for the Total Time to Complete, the best 
average results for the heavy system load test were recorded by SBN, CUBE, and GRAD. 

To compare the overall performance of the SBN-based algorithms to the other approaches, first 
note that the performance of CUBE is very similar to that of SBN. This indicates that our compar- 
isons are fair even though the topology of SBN is different from that of the other algorithms. The 
similarity in performance between CUBE and SBN is not surprising since both topologies have a 
depth of log P and use the same basic balancing approach. 

To continue with our performance comparison, consider the heavy system load (cf. Fig. 9) 
experiment. In this test. Total Time to Complete is the most important metric. Clearly, NOBAL 
and RAND are the worst performers, and are hence non-competitive. Looking at the average 
performance of the experiment, we find that SEND, ACWN, and SBZ are worse than the average 
performance of SBN and CUBE by 6% to 10%. 

We next look at the experimental results under the transition from heavy to light system load 
(cf. Fig. 10) and the light system load (cf. Fig. 11) scenarios for the remaining five algorithms 
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(GRAD, RECV, TWA, SBN, and CUBE). In these two tests, Message Traffic Comparison is the 
most important metric since we expect the completion times to be near-optimal for all reasonable 
load-balancing approaches. Figures 10 and 1 1 show that GRAD and RECV have significantly 
greater message traffic than TWA and the SBN-based algorithms. 

While comparing TWA to the SBN algorithms, we also want to determine how much load- 
balancing overhead is incurred by TWA. This is important because TWA suspends application 
processing during balancing. Our results show that TWA spends between 3.23% and 4.86% of 
the execution time in balancing a network of 32 processors. With a network of 16 processors, 
the overhead varies between 0.73% and 1.08%. Based on the superlinear increase in the fraction 
of time spent by TWA in balancing the load, the overhead could potentially become intolerable 
for large networks of processors. By contrast, SBN hides all of the load-balancing time since 
processing is never suspended. 

Based on the above analysis, we conclude that the SBN approach is a viable alternative and 
compares favorably with other load-balancing algorithms. All three SBN strategies are effective 
because global load information is obtained from all processors to balance the system load. Most 
of the other load-balancing algorithms work only locally with processors interacting with their im- 
mediate neighbors; therefore, the load information is likely to be less accurate. Even though the 
SBN approach is global, the load-balancing overhead is significantly reduced because the depth of 
an SBN of P processors is log P. Furthermore, the fact that SBN-based algorithms allow applica- 
tion processing to continue uninterrupted during load balancing makes them latency tolerant: most 
of the communication and data distribution overhead is hidden under processing. 


6 Conclusions 

In this paper, we have proposed three new load-balancing algorithms (Basic, Hypercube Variant, 
and Heuristic Variant) based on a topology-independent logical communication pattern among 
processors, called a symmetric broadcast network (SBN). A detailed experimental investigation 
with synthetic workloads showed that this approach to load balancing compares favorably with 
several other schemes. The metrics that measure the Maximum Variance in Idle Time and Total 
Time to Complete, demonstrate that all three algorithms are effective in balancing the system load 
while optimizing the completion and idle times. The Message Traffic Comparison metric shows 
that use of the Heuristic Variant reduces the overhead associated with load balancing traffic when 
compared to the two other SBN-based algorithms. 

The Basic SBN algorithm has been extended and effectively applied to a dynamic adaptive- 
mesh application to balance processor workloads while significantly reducing the data redistribu- 
tion costs [4], This optimization was possible by overlapping processing and workload migration. 
The latency-tolerance feature makes the SBN approach a natural choice for grid and cluster com- 
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puting environments consisting of heterogeneous computers. The SBN topology also provides 
fault tolerance that would allow applications to continue correct execution while using resources 
that are constantly changing. These will be the focus of future research. 
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